Overview

E-commerce is a trending business sector, of electronically buying and selling products over the internet. This fastgrowing industry has also adapted features like online payments, internet marketing, door-step delivery, returns and exchanges, etc. Since it is a widespread industry that adds to human convenience and luxuries, there is a lot of scope for improvements and business growth, which brings profits to both consumers and industry.

Understanding the problem

Description of the dataset

This is a Brazilian ecommerce public dataset of orders made at Olist Store. The dataset has information of 100k orders from 2016 to 2018 made at multiple marketplaces in Brazil. The Dataset contains multiple tables providing information on actual ecommerce purchases along with various other dimensions of a particular purchase like Customer Information, Seller Information, Product Metadata as well as Customer reviews on their purchase experience. The dataset also has geolocation file that relates Brazilian zip codes to latitude/longitude coordinates.

Order status of customers of each state

custplot <- ordertable  %>%
  filter(order_status %in% c("delivered", "canceled"))%>%
  ggplot(aes(x=customer_state, fill=order_status)) +
  geom_bar(show.legend=FALSE)+facet_wrap(~order_status, scales = "free")+
  labs(x="Customer states in brazil", y="No. of orders of different status for customers in different states" , title = "Order status plot of customers of different state")
ggplotly(custplot)

Add Title

g <- ggplot() + 
  geom_sf(data=state, fill="#2D3E50", color="#FEBF57", size=.15, show.legend = FALSE) 
print(g)

#all states
a <- g + geom_point(ordergeotable, mapping = aes(x=geolocation_lng, y=geolocation_lat, color = customer_state), position = "jitter", size=.15, alpha=1/2)+
  coord_sf(xlim = c(-75,-30), ylim = c(-40,5), expand = FALSE)+
  labs(y="Latitude of location.", x="Longitude of location.",title="Mapping of different states of brazil in Map")

plot(a)

Add Title

freq_count <- as.data.frame(table(ordertable$customer_unique_id))
odtbl <- ordertable %>% select_at(vars(year, month, customer_unique_id)) %>%
  distinct_at(vars(year,month, customer_unique_id)) %>% 
  arrange_at(vars(year, month)) %>% 
  count_(vars(year, month)) %>% 
  ggplot() + geom_line(mapping = aes(x = month, y = n, group = 1, color="red"),show.legend = FALSE) +facet_wrap(~year)+geom_point(mapping = aes(x = month, y = n, group = 1, color="red",show.legend=FALSE),show.legend = FALSE)+labs(x="Month", y="Count of new customers added each month", title = "New customers purchased every consecutive month")

ggplotly(odtbl)

Overall sales based on holidays

a <- items_data %>% left_join(orders_data) %>% mutate(mny = format(strptime(order_purchase_timestamp, "%Y-%m-%d %H:%M:%S"),'%Y-%m')) %>% group_by(mny) %>% summarise(total = sum(price)) %>% arrange(mny)

brazil_holidays <- brazil_holidays_data %>% mutate(mny = strftime(Date, format = "%Y-%m")) %>% group_by(mny) %>% mutate(holidays_by_week = paste0(Holiday, collapse = ",")) %>% select(mny, holidays_by_week)

brazil_holidays <- brazil_holidays[!duplicated(brazil_holidays$mny),]

p <- plot_ly(a, x = ~mny, y = ~total, type = 'scatter', mode = 'lines')

p <- p %>%
  add_trace(
    type = 'bar',
    x = brazil_holidays$mny, 
    y = 1000000,
    text = brazil_holidays$holidays_by_week,
    hoverinfo = 'text',
    marker = list(color='yellow'),
    showlegend = F,
    width = 0.3
  ) %>% layout(xaxis = list(autotick = F, dtick = 1)) %>%
  layout(title = 'Monthly sale along with yearly holidays in Brazil',
         xaxis = list(title = 'Time (Year - Month)',
                      autotick = F, dtick = 1),
         yaxis = list(title = 'Total Purchase (in $)'))

p

Clearly, the sale increases significantly when there is an event during a particular month.

Hour by day transactions

order_weekday <- orders_data %>% mutate(purchase_weekday = wday(order_purchase_timestamp), purchase_hour = format(strptime(order_purchase_timestamp, "%Y-%m-%d %H:%M:%S"),'%H')) %>%  group_by(purchase_weekday, purchase_hour) %>% summarise(total_transactions = n())

p <- plot_ly(data = order_weekday, 
             x = ~purchase_hour,
             y = ~purchase_weekday,
             z = ~total_transactions,
             type = "heatmap",
             width = 1050,
             height = 500,
             colors = colorRamp(c("white","yellow", "red")))  %>%
      layout(title = 'Transactions over the hour by day',
             xaxis = list(title = 'Hour'),
             yaxis = list(title = 'Day', tickvals = c(1, 2, 3, 4, 5, 6, 7), ticktext = c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday")),
             legend = list(title = "Total Transactions")) %>% 
  add_annotations(x = order_weekday$purchase_hour, y = order_weekday$purchase_weekday, text = order_weekday$total_transactions, xref = 'x', yref = 'y', showarrow = FALSE, font=list(color='black'))
p

The heatmap indiciates that the majority of the transactions occur over the weekdays during office hours. We can thus increases our marketing strategies during this period.

Total payment vs Total Transactions based on payment methods

payment_sum <- payments_data %>% filter(payment_type != "not_defined") %>% group_by(payment_type) %>% summarise(sum = sum(payment_value))
payment_count <- payments_data %>% filter(payment_type != "not_defined") %>% group_by(payment_type) %>% summarise(count = n())
p <- plot_ly() %>%
add_pie(data = payment_count, labels = ~payment_type, values = ~count, domain = list(x = c(0, 0.4), y = c(0.4, 1))) %>%
add_pie(data = payment_sum, labels = ~payment_type, values = ~sum, domain = list(x = c(0.6, 1), y = c(0.4, 1))) %>%
layout(title = "Number of payments vs Total payment values", showlegend = F,
xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))

p

The market in brazil is heavily dominated by credit card payments. Credit cards are used in ~74% of the total transactions amounting to ~78% of the total market revenue for Olist.

Payment type growth analysis

payment_group <- payments_data %>% filter(payment_type != "not_defined") %>% group_by(order_ID, payment_type) %>% summarise(count = n())
payment_order_group <- payment_group %>% left_join(orders_data) %>% select(order_ID, payment_type, count, order_purchase_timestamp) %>% mutate(purchase_mny = format(strptime(order_purchase_timestamp, "%Y-%m-%d %H:%M:%S"),'%Y-%m'))
payment_abc <- payment_order_group %>% group_by(purchase_mny, payment_type) %>% summarise(total_count = n())%>% ungroup()

p <- plot_ly(payment_abc, x = ~purchase_mny, y = ~total_count, color = ~payment_type, type = 'scatter', mode = 'lines+markers') %>% layout(title = 'Payment type growth monthly',
       xaxis = list(title = 'Time (Year - Month)'),
       yaxis = list(title = 'Total number of transactions'))
p